Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_1

1. Introduction to QIIME 2

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

This chapter describes the foundations of the QIIME 2 approach for bioinformatic and biostatistical analyses of microbiome data. It first provides an overview of QIIME 2, and then introduces the core concepts in QIIME 2 and the installation of QIIME 2. Next it introduces how to store, track, and extract data in QIIME 2.

Keywords

QIIME QIIME 2 DADA2 Deblur Artifacts Visualizations Semantic type Plugins QIIME 2 archives q2-data2 q2-feature-table q2-types

In this chapter, we provide the foundations of the QIIME 2 approach to bioinformatic and statistical analysis of microbiome data. We first provide an overview of QIIME 2 (Sect. 1.1). Then we introduce the core concepts in QIIME 2 (Sect. 1.2). Next, we introduce how to install QIIME 2 (Sect. 1.3). Following that, we introduce how to store and track data in QIIME 2 (Sect. 1.4) and extract data from QIIME 2 (Sect. 1.5). We complete this chapter by a brief summary (Sect. 1.6).

1.1 Overview of QIIME 2

QIIME (canonically pronounced chime: Quantitative Insights Into Microbial Ecology) is an open source microbiome bioinformatics platform designed for analyzing microbial ecological communities (J. G. Caporaso et al. 2010). It has been used in many microbiome studies including analysis of bacterial, archaeal, fungal, or viral sequence data. QIIME analysis generally starts with raw sequence data (in FASTA format) generated from any sequencing technology. QIIME scripts primarily wrap other software packages. It can analyze high-throughput data in a wide variety of ways. QIIME is implemented as a collection of command-line scripts designed to take users from raw sequence data and sample metadata through publication-quality graphics and statistics (Xia et al. 2018). QIIME 2 has succeeded QIIME 1 on January 1, 2018. QIIME 1 is no longer supported since end of 2017. QIIME 2 (Bolyen et al. 2019) is not only a complete redesign and rewritten version of the QIIME 1, but also a completely reengineered and rewritten system aiming to facilitate reproducible and modular analysis of microbiome data.

However, for the users who use QIIME 2 as an analytical platform of microbiome data, the following three functionalities are more important:

(1)
QIIME 2 is a bioinformatic analysis tool via wrapping other sequencing platforms, such as DADA2 (Callahan et al. 2016) and Deblur (Amir et al. 2017) to perform analysis, data generation, sequence quality control, taxonomy assignment, phylogenetic insertion, and other functions. This topic is one core part of this book, which is substantially introduced in Chaps. 3, 4, 5, and 6.
(2)
QIIME 2 is also able to perform basical statistical analysis of microbiome data, such as calculating and conducting alpha diversity and beta diversity analysis. This part of contents is introduced in Chaps. 9, 10, 11, 14, and 15, respectively.
(3)
QIIME 2 can also be used to manage and visualize microbiome data and meta data, such as Sect. 3.5, and do emperor plots (see Sect. 10.4.2).

The QIIME 2 system architecture consists of three core components: the framework, the interfaces, and the plugins. Interfaces (q2cli, q2studio, and Artifact API) define how users interact with the system; plugins define all domain-specific functionality (such as q2-data2, q2-feature-table, and q2-types). Interfaces and plugins do not communicate directly with one another; it is the framework that mediates communication between plugins and interfaces and performs core functionality such as provenance tracking. For the details of how QIIME 2 system works, the interested reader can check the QIIME 2 website and Bolyen et al. (Bolyen et al. 2019).

QIIME is a python interface that combines many independent scripts for the analysis of microbiome data, allowing data to be analyzed all the way from raw sequence data to diversity indices and taxonomic breakdowns. QIIME 2 has some distinctive features that make it to be powerful and distinguish it from other open source software tools for microbiome data science.

First, QIIME 2 uniquely and powerfully wraps the data generation tools into plugins (software packages). Continuing the development of QIIME 1, QIIME 2 is developed based on a plugin architecture and wrapping other bioinformatics tools, providing its capabilities to analyze high-throughput data through various ways, such as barcode splitting, cleaning and filtering low-quality sequences, removing chimera sequences via USEARCH 6.1 (Edgar 2010), PyNAST alignment (J. Gregory Caporaso et al. 2009), taxonomic analysis, tree building, and clustering samples. Currently QIIME 2 has wrapped DADA2 and Deblur (the latest generation tools) for sequence quality control, taxonomy assignment, and phylogenetic insertion. Among a number of bioinformatics tools, QIIME and mothur (Schloss et al. 2009) were reviewed as the two outstanding pipelines (Nilakanta et al. 2014) because of their comprehensive features and support documentation. QIIME 2 is getting the most popular since its release of reengineered and rewritten system and currently has become the dominant tool for 16S microbiome data analysis.

Second, QIIME 2 provides basic statistical analysis, including alpha- and beta-diversify analyses, as well as supports qualitatively new functionality, such as microbiome paired sample, timeseries analysis, machine learning, compositional analysis, and gneiss analysis.

Third, QIIME 2 not only provides “upstream” processing steps (e.g., sequence demultiplexing and quality control) for generating data, but also provides many interactive visualization tools to facilitate exploratory analyses and provides publication-quality graphics for result reporting. The users can also through QIIME 2 View (https://view.qiime2.org) securely share and interact with results without installing QIIME 2.

Finally, QIIME 2 not only is a powerful marker gene analysis tool but also has the potential to serve as a multidimensional and powerful data science platform for multi-omic microbiome studies, for example, through equipping newly released plugins including q2-cscs, q2-metabolomics, q2-shogun, q2-metaphlan2, and q2-picrust2, and integrating other data types, such as metabolite or metatranscriptome profiles, allowing users to run MetaPhlAn2 through QIIME. QIIME 2 can perform initial analysis of metabolomics and shotgun metagenomics data. Thus, QIIME 2 can be rapidly adapted to analyze diverse microbiome features and multi-omics data.

1.2 Core Concepts in QIIME 2

In QIIME 2, an Action (i.e., methods, visualizers, and pipelines) creates a Result, and a Result can be either an Artifact or a Visualization. Each execution of an Action and all Results created by Actions are assigned with the version 4 universally unique identifiers (UUIDs). There are five core concepts in QIIME 2. We describe them below.

1.2.1 Artifacts

Artifacts Are Data Files

In QIIME 2, artifacts are defined as data files. The term artifact is an object that is made in a similar way as in an archaeological artifact. An Artifact is data generated by one or more Actions; or formally artifacts are the way of existence for data produced by QIIME 2. An artifact contains data and metadata. A metadata is the dataset that is used to describe data type, format, and provenance (i.e., how it was generated). QIIME 2 typically uses the .qza file extension to store an artifact in a file. QIIME 2 does not work with simple data files (e.g., FASTA files), instead works with artifacts. There are two benefits for using artifacts instead of simple data files: (1) automatically tracks the type, format, and provenance of data and (2) enables researchers to focus on conducting the analyses, instead of paying much attention to the particular data format. Because QIIME 2 directly works with artifacts, thus, when we import data, we must first create a QIIME 2 artifact. Typically we start importing raw sequence data, although data can be imported in any step of analysis. In QIIME 2, data also can be exported from an artifact.

1.2.2 Visualizations

Visualizations Are also Data Files

Visualizations are another type of data generated by QIIME 2. QIIME 2 typically uses the .qzv file extension to store visualization files. Visualizations contain similar types of metadata as artifacts, including provenance information. Because in QIIME 2 both artifacts and visualizations include metadata and particularly have unique provenance information, so they can be archived or shared with collaborators. Both artifacts and visualizations files (generally with .qza and .qzv extensions, respectively) can be easily reviewed using the website (https://view.qiime2.org) without requiring a QIIME installation. The examples of visualizations are a statistical results table, an interactive visualization, static images, or really any combination of visual data representations. All these are terminal outputs of an analysis, which is contrast to artifacts. Thus, we cannot use visualizations as input to other analyses in QIIME 2.

1.2.3 Semantic Type

Every Artifact Has a Semantic Type Associated with It

In QIIME 2, all Artifacts are annotated with a semantic description of their type. Data types present how data is represented in memory, and file formats present how data is stored on disk, whereas semantic types differ from data type and file formats, conveying the meaning of the data. The semantic types are used to (1) identify artifacts for suitable inputs to an analysis, which can prevent incompatible artifacts from being used in the analysis. It effectively constrains multiple actions to only those which are semantically meaningful actions. For example, in QIIME 2, phylogenetic trees have two semantic types: Phylogeny[Rooted] and Phylogeny[Unrooted]. The beta-phylogenetic action needs a UniFrac distance matrix, which works only on Phylogeny[Rooted], while fasttree can only generate a Phylogeny[Unrooted]. Thus the semantic type system is used to determine that the output of fasttree should not be directly provided as input to beta-phylogenetic. (2) They help users avoid incorrectly using data for semantically incorrect analyses. For example, categorical data cannot be used for a quantitative analysis, vice versa. (3) Also they facilitate users in identifying relevant workflows to generate desired data or further explore data.

1.2.4 Plugins

Microbiome Analyses Are Implemented via Plugins

QIIME 2 implements microbiome analyses via plugins (software packages). Thus, at least one plugin needs to be installed to provide the specific analyses, such as the q2-demux plugin for demultiplexing raw sequence data, the q2-diversity plugin for alpha- or beta-diversity analyses. The initial plugins for an end-to-end microbiome analysis pipeline were developed by the QIIME 2 team; other plugins for additional analyses are developed by third-party developers and volunteers.

Plugins Define Methods and Visualizers to Perform Analyses

To perform microbiome analyses, QIIME 2 plugins must define methods and visualizers. The question is: how to define them? A method is defined to combine artifacts and parameters as input, which produces one or more artifacts as output. The resulted output artifacts could subsequently be used as input to other methods or visualizers. The outputs produced by methods can be intermediate or terminal. One example of method definition is as below: in the q2-feature-table plugin, the rarefy method is defined to use a feature table artifact and sampling depth as input, producing a rarefied feature table artifact as output. Then we could use the resulted rarefied feature table artifact to calculate alpha diversity by providing an alpha method in the q2-diversity plugin. A visualizer is similar to a method in the sense that it combines artifacts and parameters as input. However, a visualizer is distinctive to a method: it produces exactly one terminal visualization as output, i.e., the resulted visualizations cannot be used as input to other methods or visualizers.

1.3 Install QIIME 2

Currently QIIME 2 is available for macOS, Windows, and Linux users. All have three approaches of installation: a native conda installation, and Docker and VirtualBox. The native conda installation usually works well. Thus QIIME 2 generally recommends the method of native conda installation.

We can install QIIME 2 natively or use virtual machines. Below we introduce natively installing QIIME 2 (Core 2022.2 distribution). The readers can easily follow the instruction of QIIME 2 document to install QIIME 2 using virtual machines. The QIIME 2 Core 2022.2 distribution includes the QIIME 2 framework, q2cli (a QIIME 2 command-line interface), and the following plugins: q2-alignment, q2-composition, q2-cutadapt, q2-dada2, q2-deblur, q2-demux, q2-diversity, q2-diversity-lib, q2-emperor, q2-feature-classifier, q2-feature-table, q2-fragment-insertion, q2-gneiss, q2-longitudinal, q2-metadata, q2-phylogeny, q2-quality-control, q2-quality-filter, q2-sample-classifier, q2-taxa, q2-types, and q2-vsearch.

Install Miniconda and QIIME 2 on Mac

The recommended way to natively install QIIME 2 is through installing Miniconda. Miniconda provides the conda environment and package manager. QIIME 2 works within a conda environment. To install Miniconda in Mac, the following steps are needed:

Follow the Miniconda instructions for downloading and installing Miniconda. QIIME 2 works with either Miniconda2 or Miniconda3 (i.e., Miniconda Python 2 or 3). You may choose either one. Please follow the directions provided in the Miniconda instructions; particularly ensure that you run conda init and your Miniconda installation is fully installed.

The latest version is QIIME 2 Core 2022.2 distribution (February, 2022). To install this version, we can take the following steps:

Step 1: Click the website link https://docs.conda.io/en/latest/miniconda.html to download Miniconda (64-bit bash installer).

You can choose either Python 3.7 or Python 2.7 for Mac OS X. A file called Miniconda3-latest-MacOSX-x86_64.sh will be shown in your Downloads folder.

Step 2: Open a Mac Terminal (Terminal is within Utilities folder nested in Applications folder). Type: cd Downloads
Step 3: Run the bash “shell” script to install Miniconda. In the terminal, type: bash Miniconda3-latest-MacOSX-x86_64.sh

Press the ENTER to scroll through the license, and accept all the default installations.

Step 4: Close the Terminal, and open a new Terminal. Type: conda -V

If you see the “conda 4.10.3” or later version information, then it indicates that you have successfully installed conda via miniconda on your Mac. To update Miniconda and check if you’re running the latest version of conda, type:

conda update conda

Then, install wget, type:

conda install wget

Step 5: Install QIIME 2 within the conda environment.

Since we have Miniconda installed, a conda environment is created. Now we are ready for installing QIIME 2 within this conda environment. We can natively install the QIIME 2 Core 2022.2 distribution within the environment. Because there are many required dependencies that are not needed to be added to an existing environment, QIIME 2 highly recommends creating a new environment specifically for the QIIME 2 release being installed. Here, we name the environment qiime2-2022.2 to indicate this is QIIME 2 Core 2022.2 distribution. We choose macOS/OS X (64-bit) and in the Terminal, type:

wget https://data.qiime2.org/distro/core/qiime2-2022.2-py38-osx-conda.yml

conda env create -n qiime2-2022.2 --file qiime2-2022.2-py38-osx-conda.yml

# OPTIONAL CLEANUP

rm qiime2-2022.2-py38-osx-conda.yml

Step 6: Activate the conda environment.

To activate the QIIME 2 environment, type the environment’s name (here, qiime2-2022.2) in the Terminal:

conda activate qiime2-2022.2

It you want to deactivate an environment, run conda deactivate.

Step 7: Test the installation.

To test the installation, activate the QIIME 2 environment and run:

qiime --help

If the help information comes out when running this command, suggesting that the installation was successful!

1.4 Store and Track Data

In QIIME 2, data are stored in a directory structure called an Archive with a single root directory (UUID) serving as the identity of the archive. Several sub-directories exist under the a single root directory (UUID), including (1) a data directory, (2) a metadata directory, (3) a provenance directory, and (4) a VERSION directory. The data directory contains only the data in a relevant format. For example, the fasta or fastq for sequence data, and newick for phylogenetic trees are stored in this data directory. All files in these archives are zipped for data delivery convenience and facilitating data sharing. For example, the extension .qza is for QIIME zipped artifact and the extension .qzv is for QIIME zipped visualization. We can use unzip, WinZip, 7Zip, or other common tools to unzip these standard zip files. The zipped files are easily shared by peers or collaborators, or submitted to journals.

QIIME 2 uses data provenance tracking to store all information about the series of Actions that led to a Result, packages, and Python dependencies and other environment information and the data itself. Microbiome data analyses usually generate many output files. Data-provenance tracking is used to avoid losing track of how each file was generated.

1.5 Extract Data from QIIME 2 Archives

QIIME 2 .qza and .qzv files are zip file containers with a defined internal directory structure. Depending on whether or not QIIME 2 and the q2cli command line interface are installed in your computer, you can use the qiime tools export command or use the standard decompression utilities such as unzip, WinZip, or 7zip to access these files. Any actual data files in the .qza artifact can be extracted directly using qiime tools export, which is basically just a wrapper for unzip. Alternatively, we can unzip the artifact directly using unzip -k file.qza from the files in the data folder.

1.6 Summary

In this introductory chapter of QIIME 2, we first provided an overview of QIIME 2. Then we described some core concepts in QIIME 2, including artifacts, visualizations, semantic type, and plugins. Next, we introduced how to install QIIME 2 step-by-step. Following that, storing, tracking data, and extracting data from QIIME 2 archives were introduced.

In Chap. 3 of this book, we will introduce some basic data processing procedures in QIIME 2. Chapter 4 will introduce building feature table and feature representative sequences from raw reads. Chapter 5 will introduce assigning taxonomy and building phylogenetic tree. Chapter 6 will introduce clustering sequences into OTUs. Before we move on to these chapters, let’s introduce some general R functions, packages, and specifically designed R packages for microbiome data analysis that are used in this book (Chap. 2).

References

Amir, Amnon, Daniel McDonald, Jose A. Navas-Molina, Evguenia Kopylova, James T. Morton, Xu Zhenjiang Zech, Eric P. Kightley, Luke R. Thompson, Embriette R. Hyde, Antonio Gonzalez, and Rob Knight. 2017. Deblur rapidly resolves single-nucleotide community sequence patterns. mSystems 2 (2): e00191-16. https://doi.org/10.1128/mSystems.00191-16, https://pubmed.ncbi.nlm.nih.gov/28289731, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340863/.CrossrefPubMedPubMedCentral
Bolyen, Evan, Jai Ram Rideout, Matthew R. Dillon, Nicholas A. Bokulich, Christian C. Abnet, Gabriel A. Al-Ghalith, Harriet Alexander, Eric J. Alm, Manimozhiyan Arumugam, Francesco Asnicar, Yang Bai, Jordan E. Bisanz, Kyle Bittinger, Asker Brejnrod, Colin J. Brislawn, C. Titus Brown, Benjamin J. Callahan, Andrés Mauricio Caraballo-Rodríguez, John Chase, Emily K. Cope, Ricardo Da Silva, Christian Diener, Pieter C. Dorrestein, Gavin M. Douglas, Daniel M. Durall, Claire Duvallet, Christian F. Edwardson, Madeleine Ernst, Mehrbod Estaki, Jennifer Fouquier, Julia M. Gauglitz, Sean M. Gibbons, Deanna L. Gibson, Antonio Gonzalez, Kestrel Gorlick, Jiarong Guo, Benjamin Hillmann, Susan Holmes, Hannes Holste, Curtis Huttenhower, Gavin A. Huttley, Stefan Janssen, Alan K. Jarmusch, Lingjing Jiang, Benjamin D. Kaehler, Kyo Bin Kang, Christopher R. Keefe, Paul Keim, Scott T. Kelley, Dan Knights, Irina Koester, Tomasz Kosciolek, Jorden Kreps, Morgan G.I. Langille, Joslynn Lee, Ruth Ley, Yong-Xin Liu, Erikka Loftfield, Catherine Lozupone, Massoud Maher, Clarisse Marotz, Bryan D. Martin, Daniel McDonald, Lauren J. McIver, Alexey V. Melnik, Jessica L. Metcalf, Sydney C. Morgan, Jamie T. Morton, Ahmad Turan Naimey, Jose A. Navas-Molina, Louis Felix Nothias, Stephanie B. Orchanian, Talima Pearson, Samuel L. Peoples, Daniel Petras, Mary Lai Preuss, Elmar Pruesse, Lasse Buur Rasmussen, Adam Rivers, Michael S. Robeson 2nd, Patrick Rosenthal, Nicola Segata, Michael Shaffer, Arron Shiffer, Rashmi Sinha, Se Jin Song, John R. Spear, Austin D. Swafford, Luke R. Thompson, Pedro J. Torres, Pauline Trinh, Anupriya Tripathi, Peter J. Turnbaugh, Sabah Ul-Hasan, Justin J.J. van der Hooft, Fernando Vargas, Yoshiki Vázquez-Baeza, Emily Vogtmann, Max von Hippel, William Walters, Yunhu Wan, Mingxun Wang, Jonathan Warren, Kyle C. Weber, Charles H.D. Williamson, Amy D. Willis, Zhenjiang Zech Xu, Jesse R. Zaneveld, Yilong Zhang, Qiyun Zhu, Rob Knight, and J. Gregory Caporaso. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37 (8): 852–857. https://doi.org/10.1038/s41587-019-0209-9, https://pubmed.ncbi.nlm.nih.gov/31341288, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7015180/.CrossrefPubMedPubMedCentral
Callahan, Benjamin J., Paul J. McMurdie, Michael J. Rosen, Andrew W. Han, Jo A. Amy, and Johnson, and Susan P. Holmes. 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13 (7): 581–583. https://doi.org/10.1038/nmeth.3869.CrossrefPubMedPubMedCentral
Caporaso, J.G., J. Kuczynski, J. Stombaugh, K. Bittinger, et al. 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7 (5): 335–336. https://doi.org/10.1038/nmeth.f.303.CrossrefPubMedPubMedCentral
Caporaso, J. Gregory, Kyle Bittinger, Frederic D. Bushman, Todd Z. DeSantis, Gary L. Andersen, and Rob Knight. 2009. PyNAST: A flexible tool for aligning sequences to a template alignment. Bioinformatics 26 (2): 266–267. https://doi.org/10.1093/bioinformatics/btp636.CrossrefPubMedPubMedCentral
Edgar, Robert C. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26 (19): 2460–2461. https://doi.org/10.1093/bioinformatics/btq461.CrossrefPubMed
Nilakanta, Haema, Kimberly L. Drews, Suzanne Firrell, Mary A. Foulkes, and Kathleen A. Jablonski. 2014. A review of software for analyzing molecular sequences. BMC Research Notes 7 (1): 830. https://doi.org/10.1186/1756-0500-7-830.CrossrefPubMedPubMedCentral
Schloss, P.D., S.L. Westcott, T. Ryabin, J.R. Hall, M. Hartmann, E.B. Hollister, R.A. Lesniewski, B.B. Oakley, D.H. Parks, C.J. Robinson, J.W. Sahl, B. Stres, G.G. Thallinger, D.J. Van Horn, and C.F. Weber. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology 75 (23): 7537–7541. https://doi.org/10.1128/aem.01541-09.CrossrefPubMedPubMedCentral
Xia, Yinglin, Jun Sun, and Ding-Geng Chen. 2018. Bioinformatic analysis of microbiome data. In Statistical Analysis of Microbiome Data with R, 1–27. Singapore: Springer Singapore.Crossref